Setup¶

In [8]:
import pandas as pd
import numpy as np
import sweetviz as sv
In [9]:
data = pd.concat([pd.read_csv("../data/clean/train.csv"),
                  pd.read_csv("../data/clean/test.csv")]).reset_index(drop=True)

Data profiling¶

In [10]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72739 entries, 0 to 72738
Data columns (total 43 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_response_rate                            65965 non-null  float64
 1   host_acceptance_rate                          66597 non-null  float64
 2   latitude                                      72739 non-null  float64
 3   longitude                                     72739 non-null  float64
 4   accommodates                                  72739 non-null  int64  
 5   bathrooms                                     72718 non-null  float64
 6   bedrooms                                      72567 non-null  float64
 7   beds                                          72515 non-null  float64
 8   price                                         72739 non-null  float64
 9   minimum_nights                                72739 non-null  int64  
 10  maximum_nights                                72739 non-null  int64  
 11  availability_30                               72739 non-null  int64  
 12  availability_60                               72739 non-null  int64  
 13  availability_90                               72739 non-null  int64  
 14  availability_365                              72739 non-null  int64  
 15  number_of_reviews                             72739 non-null  int64  
 16  number_of_reviews_ltm                         72739 non-null  int64  
 17  number_of_reviews_l30d                        72739 non-null  int64  
 18  review_scores_rating                          57901 non-null  float64
 19  review_scores_accuracy                        57856 non-null  float64
 20  review_scores_cleanliness                     57855 non-null  float64
 21  review_scores_checkin                         57853 non-null  float64
 22  review_scores_communication                   57857 non-null  float64
 23  review_scores_location                        57851 non-null  float64
 24  review_scores_value                           57852 non-null  float64
 25  calculated_host_listings_count                72739 non-null  int64  
 26  calculated_host_listings_count_entire_homes   72739 non-null  int64  
 27  calculated_host_listings_count_private_rooms  72739 non-null  int64  
 28  calculated_host_listings_count_shared_rooms   72739 non-null  int64  
 29  reviews_per_month                             57901 non-null  float64
 30  host_is_superhost_flag                        72739 non-null  int64  
 31  host_has_profile_pic_flag                     72739 non-null  int64  
 32  host_identity_verified_flag                   72739 non-null  int64  
 33  has_availability_flag                         72739 non-null  int64  
 34  instant_bookable_flag                         72739 non-null  int64  
 35  host_email_verified_flag                      72739 non-null  int64  
 36  host_phone_verified_flag                      72739 non-null  int64  
 37  host_work_email_verified_flag                 72739 non-null  int64  
 38  host_response_time                            65965 non-null  object 
 39  property_type                                 72739 non-null  object 
 40  room_type                                     72739 non-null  object 
 41  state                                         72739 non-null  object 
 42  city                                          72739 non-null  object 
dtypes: float64(16), int64(22), object(5)
memory usage: 23.9+ MB
In [11]:
# Numerical features
data.describe(include=['int64','float64'],
              percentiles=[0.01]+np.arange(0.1,1,0.1).tolist()+[0.99]).T
Out[11]:
count mean std min 1% 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% max
host_response_rate 65965.0 95.893019 15.489104 0.000000 0.000000 92.000000 99.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.00000
host_acceptance_rate 66597.0 88.525504 21.428897 0.000000 0.000000 64.000000 83.000000 92.000000 96.000000 98.000000 99.000000 100.000000 100.000000 100.000000 100.000000 100.00000
latitude 72739.0 34.002513 2.997901 30.097450 30.196625 30.286110 32.734058 32.796360 33.771102 33.993103 34.047783 34.086027 34.139844 41.764362 41.967421 42.02220
longitude 72739.0 -110.288256 11.438353 -118.917134 -118.640056 -118.444862 -118.373404 -118.314606 -118.189719 -117.892080 -117.168598 -97.784765 -97.704548 -87.752817 -87.616778 -87.52842
accommodates 72739.0 4.575757 3.185178 1.000000 1.000000 2.000000 2.000000 2.000000 3.000000 4.000000 4.000000 6.000000 6.000000 8.000000 16.000000 16.00000
bathrooms 72718.0 1.635276 1.109639 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.500000 2.000000 2.000000 3.000000 6.000000 50.00000
bedrooms 72567.0 1.862568 1.377400 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 3.000000 4.000000 6.000000 50.00000
beds 72515.0 2.439716 2.165706 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 3.000000 4.000000 5.000000 10.000000 132.00000
price 72739.0 283.583016 670.223217 5.000000 32.000000 63.000000 87.000000 110.000000 132.000000 159.000000 194.000000 240.000000 319.000000 500.000000 2103.720000 56425.00000
minimum_nights 72739.0 13.177745 21.906856 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 3.000000 4.000000 30.000000 30.000000 31.000000 60.000000 1000.00000
maximum_nights 72739.0 454.880229 404.497425 1.000000 7.000000 28.000000 60.000000 180.000000 365.000000 365.000000 365.000000 365.000000 1125.000000 1125.000000 1125.000000 3650.00000
availability_30 72739.0 15.386931 11.162401 0.000000 0.000000 0.000000 3.000000 6.000000 11.000000 15.000000 20.000000 25.000000 29.000000 30.000000 30.000000 30.00000
availability_60 72739.0 35.211702 20.586669 0.000000 0.000000 2.000000 13.000000 23.000000 31.000000 37.000000 45.000000 53.000000 58.000000 60.000000 60.000000 60.00000
availability_90 72739.0 58.210712 28.757990 0.000000 0.000000 9.800000 31.000000 45.000000 56.000000 64.000000 73.000000 82.000000 88.000000 90.000000 90.000000 90.00000
availability_365 72739.0 222.487386 113.363258 0.000000 4.000000 61.000000 91.000000 148.000000 180.000000 244.000000 270.000000 318.000000 345.000000 363.000000 365.000000 365.00000
number_of_reviews 72739.0 47.677656 92.212599 0.000000 0.000000 0.000000 0.000000 2.000000 5.000000 11.000000 22.000000 39.000000 70.000000 137.000000 439.000000 3689.00000
number_of_reviews_ltm 72739.0 12.210217 19.235970 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 3.000000 7.000000 14.000000 23.000000 37.000000 79.000000 666.00000
number_of_reviews_l30d 72739.0 1.043498 1.791507 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 2.000000 3.000000 7.000000 60.00000
review_scores_rating 57901.0 4.793315 0.374320 1.000000 3.000000 4.500000 4.690000 4.790000 4.850000 4.900000 4.940000 4.980000 5.000000 5.000000 5.000000 5.00000
review_scores_accuracy 57856.0 4.816045 0.354837 1.000000 3.000000 4.560000 4.740000 4.820000 4.880000 4.920000 4.950000 4.990000 5.000000 5.000000 5.000000 5.00000
review_scores_cleanliness 57855.0 4.772888 0.382629 1.000000 3.000000 4.470000 4.670000 4.760000 4.830000 4.880000 4.930000 4.970000 5.000000 5.000000 5.000000 5.00000
review_scores_checkin 57853.0 4.865768 0.318956 1.000000 3.500000 4.670000 4.820000 4.890000 4.930000 4.960000 4.980000 5.000000 5.000000 5.000000 5.000000 5.00000
review_scores_communication 57857.0 4.869514 0.326593 1.000000 3.500000 4.680000 4.830000 4.900000 4.940000 4.970000 5.000000 5.000000 5.000000 5.000000 5.000000 5.00000
review_scores_location 57851.0 4.797704 0.340652 1.000000 3.330000 4.500000 4.690000 4.790000 4.850000 4.900000 4.940000 4.980000 5.000000 5.000000 5.000000 5.00000
review_scores_value 57852.0 4.714665 0.405423 0.000000 3.000000 4.400000 4.600000 4.690000 4.760000 4.810000 4.860000 4.900000 4.970000 5.000000 5.000000 5.00000
calculated_host_listings_count 72739.0 20.362172 67.831619 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 3.000000 5.000000 9.000000 17.000000 44.000000 549.000000 569.00000
calculated_host_listings_count_entire_homes 72739.0 18.411609 67.650574 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 2.000000 3.000000 6.000000 13.000000 38.000000 549.000000 569.00000
calculated_host_listings_count_private_rooms 72739.0 1.412076 5.581708 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 3.000000 27.000000 89.00000
calculated_host_listings_count_shared_rooms 72739.0 0.217751 2.588695 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 61.00000
reviews_per_month 57901.0 1.783405 1.869215 0.010000 0.030000 0.140000 0.290000 0.510000 0.850000 1.230000 1.760000 2.330000 3.050000 4.120000 7.570000 56.46000
host_is_superhost_flag 72739.0 0.444562 0.496921 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
host_has_profile_pic_flag 72739.0 0.972848 0.162527 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
host_identity_verified_flag 72739.0 0.906735 0.290805 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
has_availability_flag 72739.0 0.990761 0.095673 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
instant_bookable_flag 72739.0 0.324668 0.468254 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.00000
host_email_verified_flag 72739.0 0.912867 0.282032 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
host_phone_verified_flag 72739.0 0.999381 0.024865 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
host_work_email_verified_flag 72739.0 0.145232 0.352337 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.00000
In [12]:
# Categoriacal features
data.describe(include=['object']).T
Out[12]:
count unique top freq
host_response_time 65965 4 within an hour 53510
property_type 72739 33 entire home 21773
room_type 72739 4 entire home/apt 58660
state 72739 3 california 48950
city 72739 5 los angeles 37296

Descriptive analysis¶

The following summaries include various statistics about each feature and even show their relationship with the target variable. Moreover, the detailed summary of the target variable shows feature importance rankings for both numerical and categorical features.

In [13]:
report = sv.analyze(data.sample(frac=0.1, random_state=123), target_feat='price')
report.show_notebook()
Done! Use 'show' commands to display/save.   |██████████| [100%]   00:03 -> (00:00 left)        

Since price follows an exponential distribution, let's apply a log10 transformation to double-check its relationship with the predictor variables

In [14]:
# Transform price using log10
data_log_price = data.copy()
data_log_price['log_price'] = np.log10(data_log_price['price'])
data_log_price.pop('price')

report = sv.analyze(data_log_price.sample(frac=0.1, random_state=123), target_feat='log_price')
report.show_notebook()
Done! Use 'show' commands to display/save.   |██████████| [100%]   00:02 -> (00:00 left)